The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 755 million cases have been confirmed worldwide, with nearly 6.8 million associated deaths by February 2023. Within the US alone, there have been over 1.1 million deaths and upwards of 102 million cases reported by February of 2023. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.
The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.
We assembled this dataset for our research with the goal of investigating the effectiveness of lockdowns in flattening the COVID curve. We are providing a portion of the cleaned dataset for this case study.
There are two main goals for this case study:
We aim to show the dynamic evolution of COVID cases and COVID-related deaths at the state level.
We aim to identify county-level demographic and policy interventions that are associated with mortality rates in the US. We will construct models to find possible factors related to county-level COVID-19 mortality rates.
This is a rather complex project, but with our team’s help, we have made your job easier.
Please hide all unnecessary lengthy R-output and keep your write-up neat and readable.
Remark 1: The data and the statistics reported here were collected before February of 2021.
Remark 2: A group of RAs spent tremendous amount of time working together to assemble the data. It requires data wrangling skills.
The data comes from several different sources:
In this case study, we use the following three nearly cleaned data:
Among all data, the unique identifier of county is
FIPS.
The cleaning procedure is attached in
Appendix 2: Data cleaning You may go through it if you are
interested or would like to make any changes.
First read in the data.
# county-level socialeconomic information
county_data <- fread("data/covid_county.csv")
# county-level COVID case and death
covid_rate <- fread("data/covid_rates.csv")
# county-level lockdown dates
#covid_intervention <- fread("data/covid_intervention.csv")The detailed description of variables is in
Appendix 1: Data description. Please get familiar with the
variables. Summarize the two data briefly.
It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.
covid_rate are cumulative by county. To calculate the
number of new cases by state, first use
group_by(FIPS), arrange(date), and
lag(cum_cases, default = 0) function to get lagged cases,
and then calculate daily new cases by subtracting cum_cases
by the lagged cum_cases. Finally, aggregate the daily new
cases by State and date to get the daily new
cases by state.)library(data.table)
library(dplyr)
library(ggplot2)
# Read in the data
covid_rate <- fread("data/covid_rates.csv")
# Convert date column to Date format
covid_rate[, date := as.Date(date)]
# Get daily new cases by county (FIPS)
daily <- covid_rate %>%
group_by(FIPS) %>%
arrange(date) %>%
mutate(daily_new_case = cum_cases - lag(cum_cases, default = 0))
# Get daily new cases by state
daily_state_cases <- daily %>%
group_by(State, date) %>%
summarise(daily_new_case = sum(daily_new_case, na.rm = TRUE), .groups = "drop")
# Filter for NY, WA, and FL
states_of_interest <- c("New York", "Washington", "Florida")
filtered_states <- daily_state_cases %>%
filter(State %in% states_of_interest)
# Plot daily new cases by state
ggplot(filtered_states, aes(x = date, y = daily_new_case, color = State)) +
geom_line() +
labs(title = "Daily New COVID Cases in NY, WA, and FL",
x = "Date",
y = "Daily New Cases") +
theme_minimal()The problem with using single-day data is high volatility(spikes and dips in daily pattern), it may caused by the delays in reporting. Washington has a smoother trend compared to NY and FL. Florida shows large one-day jumps. New York also has sharp peaks, especially in early 2020. Florida and New York show higher volatility, suggesting inconsistent or delayed reporting practices.
Biggest issue of using single day data is that there are many noises and too much fluctuation and is hard to interpret a general trend and summarize what is happening. Delays and artificial spikes can mislead decision-making.
weekly_case_per100k by state. Plot the spaghetti plots of
weekly_case_per100k by state. Use
TotalPopEst2019 in county_data as population.
(Hint: similar to i), first calculate the daily new cases by county, and
then calculate the weekly new cases by aggregating the daily new cases
by State and week. Then calculate the total
population by State in county_data.)# Get weekly new cases by state
weekly_state_cases <- daily %>%
group_by(State, week) %>%
summarise(weekly_new_cases = sum(daily_new_case, na.rm = TRUE), .groups = "drop")
# Get state population
pop_state <- covid_rate %>%
distinct(FIPS, State, TotalPopEst2019) %>%
group_by(State) %>%
summarize(population = sum(TotalPopEst2019, na.rm = TRUE))
# Join weekly state cases with population
weekly_state_cases <- weekly_state_cases %>%
left_join(pop_state, by = "State") %>%
mutate(weekly_case_per100k = (weekly_new_cases / population) * 100000)
# Plot weekly new cases per 100k by state
ggplot(weekly_state_cases, aes(x = week, y = weekly_case_per100k, group = State, color = State)) +
geom_line() +
labs(title = "Weekly New COVID Cases per 100k by State",
x = "Week",
y = "Weekly Cases per 100k") +
theme_minimal()The plot shows multiple peaks with an overall increasing trend.Sharp increases are followed by declines, that’s might caused by the holiday seasons. The general rise suggests sustained transmission. The differences between states could be due to varying policies, population density, and other factors. The later peaks were likely driven by relaxed mask mandates and reduced public concern about the virus’s threat to life.
covid_intervention to see whether the
effectiveness of lock down in flattening the curve.limits argument in scale_fill_gradient() or
use facet_wrap(); use lubridate::month() and
lubridate::year() to extract month and year from date; use
tidyr::complete(state, month, fill = list(new_case_per100k = NA))
to complete the missing months with no cases.)# your code here
## calculate daily new death
## calculate monthly new death by state
## calculate state population
## join state monthly death with state population
## calculate monthly new_death_per100k by state
## `usmap()` recognizes `state` instead of `State` so need to rename(state = State)
## there is no cases for some states in the first months so there is no data
## we complete the missing months with NA
# covid_monthly %>% complete(state, month, fill = list(new_death_per100k = NA))
## Plot using plot_usmap()
# plot_usmap(
# data = tmp,
# regions = "state",
# values = "new_death_per100k",
# exclude = c("Hawaii", "Alaska"),
# color = "black") +
# scale_fill_continuous(
# low = "white", high = "blue",
# name = "Deaths per 100k", label = scales::comma) +
# labs(title = "State Fatality Rate",
# subtitle = "Continental US States") +
# theme(legend.position = "right") +
# facet_wrap(~ month)plotly to animate the monthly maps in i). Does it
reveal any systematic way to capture the dynamic changes among states?
(Hints: Follow Appendix 3: Plotly heatmap:: in Module 6
regularization lecture to plot the heatmap using plotly.
Use frame argument in add_trace() for
animation. plotly only recognizes abbreviation of state
names. Use
unique(us_map(regions = "states") %>% select(abbr, full))
to get the abbreviation and merge with the data to get state
abbreviation.)We now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.
total_death_per100k as the
total of number of COVID deaths per 100k by Feb 1, 2021. We
suggest to take log transformation as
log_death_rate = log(total_death_per100k + 1). Merge
log_death_rate to county_data for the
following analysis. (Hints: the unique identifier for counties in
county_data and covid_rate is
FIPS.)library(dplyr)
# Create a subset for Feb 1, 2021
covid_county_0201 <- covid_rate %>%
filter(date == "2021-02-01") %>%
mutate(
total_death_per100k = (cum_deaths / TotalPopEst2019) * 100000, # Compute deaths per 100k
log_death_rate = log(total_death_per100k + 1) # Apply log transformation
) %>%
select(FIPS, log_death_rate)
# Merge with county-level demographic data
covid_county_0201 <- left_join(covid_county_0201, county_data, by = "FIPS")log_death_rate
vs. Age65AndOlderPct2010 and report the
summary output. Is Age65AndOlderPct2010 a
significant variable at the .05 level? State what effect
Age65AndOlderPct2010 has on log_death_rate, if
any, according to this model.# Run simple linear regression
model_age <- lm(log_death_rate ~ Age65AndOlderPct2010, data = covid_county_0201)
# Print the summary of the regression
summary(model_age)##
## Call:
## lm(formula = log_death_rate ~ Age65AndOlderPct2010, data = covid_county_0201)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.847 -0.343 0.177 0.557 1.920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.62795 0.06975 66.35 <2e-16 ***
## Age65AndOlderPct2010 0.00750 0.00423 1.77 0.077 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.976 on 3106 degrees of freedom
## Multiple R-squared: 0.00101, Adjusted R-squared: 0.000687
## F-statistic: 3.13 on 1 and 3106 DF, p-value: 0.0767
Age65AndOlderPct2010 has a positive but weak effect on log_death_rate, but the relationship is not statistically significant at the 0.05 level.
PovertyAllAgesPct on top of the variable
Age65AndOlderPct2010 to your linear model. Is
Age65AndOlderPct2010 a significant variable at the .05
level? Give a precise interpretation of the effect of
Age65AndOlderPct2010 on log_death_rate in this
model.#Run multiple linear regression with Age65AndOlderPct2010 and PovertyAllAgesPct
model_age_poverty <- lm(log_death_rate ~ Age65AndOlderPct2010 + PovertyAllAgesPct, data = covid_county_0201)
# Print the summary of the regression
summary(model_age_poverty)##
## Call:
## lm(formula = log_death_rate ~ Age65AndOlderPct2010 + PovertyAllAgesPct,
## data = covid_county_0201)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.147 -0.299 0.173 0.529 2.032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.04523 0.08209 49.28 <2e-16 ***
## Age65AndOlderPct2010 0.01027 0.00413 2.48 0.013 *
## PovertyAllAgesPct 0.03548 0.00280 12.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.952 on 3105 degrees of freedom
## Multiple R-squared: 0.0502, Adjusted R-squared: 0.0496
## F-statistic: 82 on 2 and 3105 DF, p-value: <2e-16
Age65AndOlderPct2010 is not a significant variable at the .05 level. As Age65AndOlderPct2010 increase by 1% the log_death_rate increases by 0.01027.
Age65AndOlderPct2010 are different among (i) and (ii). How
would you explain the difference to a non-statistician?We added PovertyAllAgesPct as an additional predictor.
Now, the model separately accounts for the impact of poverty on death
rates. If poverty and the percentage of elderly people are related
(which they often are), then adding poverty to the model helps to
“isolate” the true effect of Age65AndOlderPct2010 on
log_death_rate. The estimates and 95% CIs for
Age65AndOlderPct2010 changed between lm1 and lm2 because adding
PovertyAllAgesPct in model (ii) controlled for an important factor that
was previously omitted. In model (i), some of poverty’s effect was
mistakenly absorbed into the age coefficient, leading to a weaker
estimate and wider CI due to omitted variable bias. Once poverty was
included in model (ii), the effect of age became clearer, stronger, and
statistically significant. Additionally, the confidence interval
narrowed, meaning we now have a more precise estimate of age’s true
effect on COVID death rates. This demonstrates the importance of
controlling for key confounding variables when interpreting regression
results.
lm(log_death_rate ~ PovertyAllAgesPct * Age65AndOlderPct2010).
Is the interaction effect significant at .05 level? Explain the
Age65AndOlderPct2010 effect (if any).Remember that the same variable can play different roles! Take a
quick look at the variable UrbanInfluenceCode2013, and try
to use this variable in the following analyses wisely. The Urban
Influence Codes divide the 3,143 counties, county equivalents, and
independent cities in the United States into 12 groups. The definition
of UrbanInfluenceCode2013 is as follows.
| Code | Description |
|---|---|
| 1 | In large metro area of 1+ million residents |
| 2 | In small metro area of less than 1 million residents |
| 3 | Micropolitan area adjacent to large metro area |
| 4 | Noncore adjacent to large metro area |
| 5 | Micropolitan area adjacent to small metro area |
| 6 | Noncore adjacent to small metro area and contains a town of at least 2,500 residents |
| 7 | Noncore adjacent to small metro area and does not contain a town of at least 2,500 residents |
| 8 | Micropolitan area not adjacent to a metro area |
| 9 | Noncore adjacent to micro area and contains a town of at least 2,500 residents |
| 10 | Noncore adjacent to micro area and does not contain a town of at least 2,500 residents |
| 11 | Noncore not adjacent to metro or micro area and contains a town of at least 2,500 residents |
| 12 | Noncore not adjacent to metro or micro area and does not contain a town of at least 2,500 residents |
As you can see, the larger UrbanInfluenceCode2013 is,
the small population the county has. We can interpret
UrbanInfluenceCode2013 as either a continuous (numeric)
variable or a categorical variable.
UrbanInfluenceCode2013? (Hint: using table()
or group_by() + summarise())##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 472 765 132 149 245 344 164 269 184 189 125 184
UrbanInfluenceCode2013 as a
continuous/numeric variable. Is UrbanInfluenceCode2013
significant at the 0.05 level? What effect does
UrbanInfluenceCode2013 have on log_death_rate
in this model?# Fit linear model treating UrbanInfluenceCode2013 as continuous
model_urban_cont <- lm(log_death_rate ~ UrbanInfluenceCode2013, data = covid_county_0201)
# Print summary of the regression
summary(model_urban_cont)##
## Call:
## lm(formula = log_death_rate ~ UrbanInfluenceCode2013, data = covid_county_0201)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.770 -0.351 0.172 0.559 2.029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.78438 0.03163 151.2 <2e-16 ***
## UrbanInfluenceCode2013 -0.00706 0.00504 -1.4 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.976 on 3106 degrees of freedom
## Multiple R-squared: 0.000632, Adjusted R-squared: 0.00031
## F-statistic: 1.96 on 1 and 3106 DF, p-value: 0.161
Treated as a continuous variable, Urban Influence Code does not have a significant effect on log_death_rate (p-value = 0.161 > 0.05). This suggests that the urban alone is not a strong predictor of death. The negative coefficient means that as the Urban Influence Code increases (moving from urban to more rural counties), the log death rate slightly decreases. However, this effect is very small in magnitude.
UrbanInfluenceCode2013 as a
categorical/factor. Is UrbanInfluenceCode2013 significant
at the .01 level? What is the effect of
UrbanInfluenceCode2013 in this model? Describe the
UrbanInfluenceCode2013 effect over
log_death_rate. (REMEMBER to use Anova() from the
car package to test a categorical variable as a
whole!)library(car)
# Fit linear model treating UrbanInfluenceCode2013 as categorical
model_urban_cat <- lm(log_death_rate ~ factor(UrbanInfluenceCode2013), data = covid_county_0201)
# Perform ANOVA to test the overall significance
Anova(model_urban_cat, type = "III") # Type III ANOVA for categorical variable significance## Anova Table (Type III tests)
##
## Response: log_death_rate
## Sum Sq Df F value Pr(>F)
## (Intercept) 9150 1 9926.0 <2e-16 ***
## factor(UrbanInfluenceCode2013) 108 11 10.7 <2e-16 ***
## Residuals 2854 3096
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
UrbanInfluenceCode2013 is significant at the 0.01 level (p < 2e-16). When treated as a categorical variable, the effect became highly significant, indicating that urban influence categories have distinct effects on death rates.
UrbanInfluenceCode2013 as a continuous and categorical
variable in your models?Treating UrbanInfluenceCode2013 as a continuous variable assumes a linear relationship between urban influence and log_death_rate, meaning each unit increase has the same effect. However, the result was not statistically significant (p = 0.161), suggesting this assumption may not hold.
When treated as a categorical variable, each level is analyzed separately, capturing nonlinear patterns. This model was highly significant (p < 2e-16), revealing that some rural areas (e.g., codes 6 and 9) had notably higher death rates. The categorical approach provides more detailed insights into how different levels of urban influence impact death rates.
log_death_rate is linear in
UrbanInfluenceCode2013 vs. fit1:
log_death_rate relates to
UrbanInfluenceCode2013 as a categorical variable at .01
level?Select possible variables in county_data as
covariates. We provide county_data_sub, a subset variables
from county_data, for you to get started. Please add any
potential variables as you wish.
Report missing values in your final subset of variables.
In the following anaylsis, you may ignore the missing values.
county_data_sub <- county_data %>%
select(FIPS, County, State, Deep_Pov_All, PovertyAllAgesPct, PerCapitaInc, UnempRate2019,
PctEmpFIRE, PctEmpConstruction, PctEmpTrans, PctEmpMining, PctEmpTrade, PctEmpInformation,
PctEmpAgriculture, PctEmpManufacturing, PctEmpServices, PopDensity2010, OwnHomePct,
Age65AndOlderPct2010, TotalPop25Plus, Under18Pct2010, Ed2HSDiplomaOnlyPct,
Ed3SomeCollegePct, Ed4AssocDegreePct, Ed5CollegePlusPct, ForeignBornPct,
Net_International_Migration_Rate_2010_2019, NetMigrationRate1019, NaturalChangeRate1019,
TotalPopEst2019, WhiteNonHispanicPct2010, NativeAmericanNonHispanicPct2010,
BlackNonHispanicPct2010, AsianNonHispanicPct2010, HispanicPct2010, Type_2015_Update,
RuralUrbanContinuumCode2013, UrbanInfluenceCode2013, Perpov_1980_0711, HiCreativeClass2000,
HiAmenity, Retirement_Destination_2015_Update)
# Check missing values in the selected dataset
missing_values <- county_data_sub %>%
summarise_all(~ sum(is.na(.)))
# Display missing values per column
missing_values %>%
gather(key = "Variable", value = "Missing_Count") %>%
arrange(desc(Missing_Count)) %>%
filter(Missing_Count > 0)## Variable Missing_Count
## 1 HiAmenity 172
## 2 HiCreativeClass2000 140
## 3 Type_2015_Update 136
## 4 Perpov_1980_0711 136
## 5 Retirement_Destination_2015_Update 136
## 6 PovertyAllAgesPct 86
## 7 ForeignBornPct 85
## 8 Net_International_Migration_Rate_2010_2019 85
## 9 NetMigrationRate1019 85
## 10 NaturalChangeRate1019 85
## 11 RuralUrbanContinuumCode2013 57
## 12 UrbanInfluenceCode2013 57
## 13 UnempRate2019 7
## 14 PctEmpFIRE 7
## 15 PctEmpConstruction 7
## 16 PctEmpTrans 7
## 17 PctEmpMining 7
## 18 PctEmpTrade 7
## 19 PctEmpInformation 7
## 20 PctEmpAgriculture 7
## 21 PctEmpManufacturing 7
## 22 PctEmpServices 7
## 23 Deep_Pov_All 6
## 24 PerCapitaInc 6
## 25 PopDensity2010 6
## 26 OwnHomePct 6
## 27 Age65AndOlderPct2010 6
## 28 TotalPop25Plus 6
## 29 Under18Pct2010 6
## 30 Ed2HSDiplomaOnlyPct 6
## 31 Ed3SomeCollegePct 6
## 32 Ed4AssocDegreePct 6
## 33 Ed5CollegePlusPct 6
## 34 TotalPopEst2019 6
## 35 WhiteNonHispanicPct2010 6
## 36 NativeAmericanNonHispanicPct2010 6
## 37 BlackNonHispanicPct2010 6
## 38 AsianNonHispanicPct2010 6
## 39 HispanicPct2010 6
State? You may use
lambda.1se to choose a smaller model. (REMEMBER
NEVER include unique identifiers (i.e., FIPS,
County) in your model!)## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
# Step 1: Merge log_death_rate from covid_county_0201 into county_data_sub
county_data_merged <- left_join(county_data_sub, covid_county_0201 %>% select(FIPS, log_death_rate), by = "FIPS")
# Step 2: get X matrix and y
# Define target variable y
y <- county_data_merged$log_death_rate
# Define predictor variables X (excluding FIPS, County, and log_death_rate)
X <- model.matrix(log_death_rate ~ . - FIPS - County, data = county_data_merged)[, -1] # Remove intercept
# grepl(): use regular expression to locate column with some pattern
# Step 3: Identify State variables for forcing
# create indicators: names include State as 0; otherwise 1
state_ind <- ifelse(grepl("State", colnames(X)), 0, 1) # or we can hard code
# using cv.glmnet() with penalty.factor to force in State variables
# Step 4: Run LASSO regression with cross-validation
set.seed(10)
fit.lasso <- cv.glmnet(X, y,
alpha = 1, # LASSO penalty
penalty.factor = state_ind, # Force State into model
nfolds = 10) # 10-fold cross-validation
plot(fit.lasso)# Step 5: Extract the selected coefficients
coef.1se <- coef(fit.lasso, s = "lambda.1se") # Get coefficients at lambda.1se
coef.1se <- coef.1se[which(coef.1se != 0), ] # Keep only non-zero coefficients
# Step 6: Retrieve the names of the selected variables
var.1se <- rownames(as.matrix(coef.1se))[-1] # Extract variable names, excluding intercept
var.1se## [1] "StateAR" "StateAZ"
## [3] "StateCA" "StateCO"
## [5] "StateCT" "StateDC"
## [7] "StateDE" "StateFL"
## [9] "StateGA" "StateIA"
## [11] "StateID" "StateIL"
## [13] "StateIN" "StateKS"
## [15] "StateKY" "StateLA"
## [17] "StateMA" "StateMD"
## [19] "StateME" "StateMI"
## [21] "StateMN" "StateMO"
## [23] "StateMS" "StateMT"
## [25] "StateNC" "StateND"
## [27] "StateNE" "StateNH"
## [29] "StateNJ" "StateNM"
## [31] "StateNV" "StateNY"
## [33] "StateOH" "StateOK"
## [35] "StateOR" "StatePA"
## [37] "StateRI" "StateSC"
## [39] "StateSD" "StateTN"
## [41] "StateTX" "StateUT"
## [43] "StateVA" "StateVT"
## [45] "StateWA" "StateWI"
## [47] "StateWV" "StateWY"
## [49] "PovertyAllAgesPct" "PerCapitaInc"
## [51] "PctEmpConstruction" "PctEmpMining"
## [53] "PctEmpTrade" "PctEmpAgriculture"
## [55] "PctEmpManufacturing" "PopDensity2010"
## [57] "Age65AndOlderPct2010" "Under18Pct2010"
## [59] "Ed3SomeCollegePct" "Ed5CollegePlusPct"
## [61] "NetMigrationRate1019" "NaturalChangeRate1019"
## [63] "WhiteNonHispanicPct2010" "HispanicPct2010"
## [65] "Type_2015_Update"
We force in State is because we want always include State.
library(car) # Anova()
library(coefplot) # coefficient plot
# Step 1: Fit an OLS model using the LASSO-selected variables
formula <- as.formula(paste("log_death_rate ~", paste(var.1se, collapse = " + ")))
# Convert State into dummy variables
state_dummies <- model.matrix(~ State - 1, data = county_data_merged)
# Remove the original State column and bind dummy variables
county_data_final <- cbind(county_data_merged %>% select(-State), state_dummies)
fit.1se <- lm(formula, data = county_data_final)
summary(fit.1se)##
## Call:
## lm(formula = formula, data = county_data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.783 -0.251 0.067 0.375 3.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.42e+00 4.34e-01 10.20 < 2e-16 ***
## StateAR 5.61e-02 1.35e-01 0.42 0.67753
## StateAZ 2.76e-01 2.39e-01 1.16 0.24755
## StateCA -7.63e-01 1.59e-01 -4.80 1.6e-06 ***
## StateCO -3.52e-01 1.53e-01 -2.30 0.02125 *
## StateCT 1.84e-01 3.03e-01 0.61 0.54367
## StateDC 5.60e-01 8.09e-01 0.69 0.48875
## StateDE -2.26e-02 4.69e-01 -0.05 0.96148
## StateFL 3.56e-02 1.47e-01 0.24 0.80849
## StateGA 6.71e-02 1.17e-01 0.58 0.56470
## StateIA 2.86e-01 1.34e-01 2.14 0.03236 *
## StateID -3.14e-01 1.67e-01 -1.89 0.05932 .
## StateIL 2.14e-01 1.33e-01 1.62 0.10575
## StateIN -6.49e-02 1.34e-01 -0.48 0.62778
## StateKS -6.20e-01 1.38e-01 -4.50 6.9e-06 ***
## StateKY -7.48e-01 1.28e-01 -5.83 6.3e-09 ***
## StateLA 3.77e-01 1.43e-01 2.64 0.00831 **
## StateMA 8.35e-02 2.40e-01 0.35 0.72831
## StateMD -4.64e-02 1.97e-01 -0.24 0.81350
## StateME -1.37e+00 2.25e-01 -6.09 1.2e-09 ***
## StateMI -1.42e-01 1.35e-01 -1.05 0.29368
## StateMN -1.55e-01 1.38e-01 -1.12 0.26121
## StateMO -3.74e-01 1.29e-01 -2.91 0.00365 **
## StateMS 2.64e-01 1.32e-01 2.00 0.04543 *
## StateMT 5.55e-01 1.58e-01 3.51 0.00045 ***
## StateNC -3.66e-01 1.27e-01 -2.89 0.00392 **
## StateND 9.52e-01 1.66e-01 5.73 1.1e-08 ***
## StateNE -3.85e-01 1.43e-01 -2.70 0.00694 **
## StateNH -8.10e-01 2.73e-01 -2.97 0.00305 **
## StateNJ 3.29e-01 2.09e-01 1.58 0.11529
## StateNM -6.42e-01 1.92e-01 -3.33 0.00087 ***
## StateNV -5.64e-01 2.28e-01 -2.48 0.01332 *
## StateNY -3.59e-01 1.51e-01 -2.38 0.01727 *
## StateOH -5.16e-01 1.35e-01 -3.83 0.00013 ***
## StateOK -3.69e-01 1.38e-01 -2.68 0.00737 **
## StateOR -8.57e-01 1.74e-01 -4.92 9.1e-07 ***
## StatePA 4.39e-02 1.46e-01 0.30 0.76373
## StateRI -1.72e-01 3.72e-01 -0.46 0.64468
## StateSC -1.29e-02 1.53e-01 -0.08 0.93259
## StateSD 8.13e-01 1.51e-01 5.39 7.5e-08 ***
## StateTN 6.49e-02 1.30e-01 0.50 0.61867
## StateTX 1.74e-01 1.27e-01 1.37 0.17228
## StateUT -1.08e+00 1.96e-01 -5.50 4.1e-08 ***
## StateVA -4.72e-01 1.23e-01 -3.83 0.00013 ***
## StateVT -2.14e+00 2.39e-01 -8.96 < 2e-16 ***
## StateWA -8.50e-01 1.69e-01 -5.04 5.0e-07 ***
## StateWI -1.31e-01 1.40e-01 -0.93 0.35198
## StateWV -6.71e-01 1.56e-01 -4.31 1.7e-05 ***
## StateWY 2.22e-01 2.06e-01 1.08 0.28099
## PovertyAllAgesPct 2.90e-03 5.14e-03 0.56 0.57295
## PerCapitaInc -5.42e-06 5.80e-06 -0.94 0.34966
## PctEmpConstruction -4.71e-02 7.42e-03 -6.35 2.5e-10 ***
## PctEmpMining -1.78e-02 5.80e-03 -3.07 0.00213 **
## PctEmpTrade 5.61e-03 6.11e-03 0.92 0.35803
## PctEmpAgriculture -3.84e-02 3.70e-03 -10.38 < 2e-16 ***
## PctEmpManufacturing 4.02e-03 3.46e-03 1.16 0.24580
## PopDensity2010 2.58e-05 9.33e-06 2.77 0.00572 **
## Age65AndOlderPct2010 3.34e-02 7.66e-03 4.36 1.4e-05 ***
## Under18Pct2010 6.01e-02 7.83e-03 7.67 2.2e-14 ***
## Ed3SomeCollegePct -2.09e-02 5.48e-03 -3.82 0.00014 ***
## Ed5CollegePlusPct -1.08e-02 3.87e-03 -2.80 0.00516 **
## NetMigrationRate1019 -1.19e-02 2.57e-03 -4.61 4.3e-06 ***
## NaturalChangeRate1019 -4.47e-02 9.97e-03 -4.48 7.8e-06 ***
## WhiteNonHispanicPct2010 -3.02e-03 1.66e-03 -1.82 0.06946 .
## HispanicPct2010 8.69e-03 2.18e-03 3.99 6.8e-05 ***
## Type_2015_Update -1.87e-02 8.66e-03 -2.16 0.03098 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.79 on 3039 degrees of freedom
## Multiple R-squared: 0.359, Adjusted R-squared: 0.345
## F-statistic: 26.2 on 65 and 3039 DF, p-value: <2e-16
# Step 2: Perform Backward Elimination
# Iteratively remove non-significant variables (p > 0.05), but keep State
stepwise_model <- step(fit.1se, direction = "backward", trace = FALSE)
# Step 3: Ensure `StateXX` variables remain in the model (if dropped, refit including them)
final_vars <- names(coef(stepwise_model)) # Get variables actually in the model
state_vars <- grep("State", colnames(county_data_final), value = TRUE) # All State dummies
# If none of the StateXX variables remain in the selected model, add them back
if (!any(final_vars %in% state_vars)) {
formula_final <- as.formula(paste("log_death_rate ~", paste(c(final_vars, state_vars), collapse = " + ")))
final_model <- lm(formula_final, data = county_data_final)
} else {
final_model <- stepwise_model # Keep model as is
}
# Step 4: Use Type III ANOVA for Categorical Variables
Anova(final_model, type = "III")## Anova Table (Type III tests)
##
## Response: log_death_rate
## Sum Sq Df F value Pr(>F)
## (Intercept) 205 1 328.58 < 2e-16 ***
## StateCA 37 1 58.73 2.4e-14 ***
## StateCO 11 1 17.93 2.4e-05 ***
## StateIA 5 1 7.29 0.00698 **
## StateID 6 1 10.09 0.00151 **
## StateIL 2 1 3.04 0.08124 .
## StateKS 38 1 60.88 8.3e-15 ***
## StateKY 58 1 93.72 < 2e-16 ***
## StateLA 5 1 8.60 0.00338 **
## StateME 31 1 50.46 1.5e-12 ***
## StateMI 3 1 4.32 0.03772 *
## StateMN 3 1 4.96 0.02596 *
## StateMO 17 1 27.00 2.2e-07 ***
## StateMS 3 1 5.25 0.02197 *
## StateMT 9 1 14.60 0.00014 ***
## StateNC 16 1 25.63 4.4e-07 ***
## StateND 29 1 46.76 9.6e-12 ***
## StateNE 14 1 22.11 2.7e-06 ***
## StateNH 7 1 11.41 0.00074 ***
## StateNJ 1 1 2.04 0.15371
## StateNM 18 1 28.34 1.1e-07 ***
## StateNV 8 1 12.78 0.00036 ***
## StateNY 9 1 15.02 0.00011 ***
## StateOH 23 1 36.22 2.0e-09 ***
## StateOK 14 1 22.61 2.1e-06 ***
## StateOR 29 1 46.30 1.2e-11 ***
## StateSD 28 1 44.41 3.2e-11 ***
## StateUT 32 1 52.19 6.3e-13 ***
## StateVA 32 1 50.87 1.2e-12 ***
## StateVT 64 1 102.32 < 2e-16 ***
## StateWA 31 1 50.39 1.6e-12 ***
## StateWI 2 1 2.90 0.08883 .
## StateWV 25 1 40.69 2.1e-10 ***
## PerCapitaInc 1 1 2.30 0.12982
## PctEmpConstruction 32 1 52.12 6.6e-13 ***
## PctEmpMining 7 1 11.89 0.00057 ***
## PctEmpAgriculture 90 1 144.03 < 2e-16 ***
## PopDensity2010 5 1 7.59 0.00592 **
## Age65AndOlderPct2010 13 1 20.34 6.7e-06 ***
## Under18Pct2010 40 1 64.98 1.1e-15 ***
## Ed3SomeCollegePct 10 1 16.82 4.2e-05 ***
## Ed5CollegePlusPct 7 1 11.86 0.00058 ***
## NetMigrationRate1019 15 1 24.05 9.9e-07 ***
## NaturalChangeRate1019 14 1 21.94 2.9e-06 ***
## WhiteNonHispanicPct2010 4 1 6.16 0.01309 *
## HispanicPct2010 20 1 32.66 1.2e-08 ***
## Type_2015_Update 3 1 5.33 0.02101 *
## Residuals 1904 3058
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Step 5: Plot Coefficients of the Final Model
coefplot(final_model, intercept = FALSE, interactive = TRUE)# Load necessary libraries
library(ggplot2)
library(car) # For VIF and ANOVA
library(lmtest) # For Breusch-Pagan and Durbin-Watson tests## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:data.table':
##
## yearmon, yearqtr
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Step 1: Check Linearity with Residual vs. Fitted Plot
plot(final_model$fitted.values, resid(final_model),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")# Step 2: Check Normality of Residuals
# Q-Q Plot
qqPlot(final_model, main = "Q-Q Plot of Residuals")## [1] 488 1593
##
## Shapiro-Wilk normality test
##
## data: resid(final_model)
## W = 0.8, p-value <2e-16
# Step 3: Check Homoscedasticity (Constant Variance)
# Scale-Location Plot
plot(final_model, which = 3)##
## studentized Breusch-Pagan test
##
## data: final_model
## BP = 555, df = 46, p-value <2e-16
# Step 4: Check Independence of Residuals
# Durbin-Watson Test for Autocorrelation
dwtest(final_model)##
## Durbin-Watson test
##
## data: final_model
## DW = 2, p-value = 0.03
## alternative hypothesis: true autocorrelation is greater than 0
# Step 5: Check Multicollinearity with Variance Inflation Factor (VIF)
vif_values <- vif(final_model)
print(vif_values)## StateCA StateCO StateIA
## 1.21 1.18 1.22
## StateID StateIL StateKS
## 1.18 1.20 1.36
## StateKY StateLA StateME
## 1.22 1.15 1.03
## StateMI StateMN StateMO
## 1.14 1.21 1.21
## StateMS StateMT StateNC
## 1.20 1.26 1.10
## StateND StateNE StateNH
## 1.37 1.41 1.03
## StateNJ StateNM StateNV
## 1.09 1.16 1.07
## StateNY StateOH StateOK
## 1.17 1.16 1.12
## StateOR StateSD StateUT
## 1.12 1.31 1.24
## StateVA StateVT StateWA
## 1.16 1.04 1.10
## StateWI StateWV PerCapitaInc
## 1.12 1.16 4.46
## PctEmpConstruction PctEmpMining PctEmpAgriculture
## 1.27 1.49 2.16
## PopDensity2010 Age65AndOlderPct2010 Under18Pct2010
## 1.27 4.83 3.08
## Ed3SomeCollegePct Ed5CollegePlusPct NetMigrationRate1019
## 1.59 5.04 1.61
## NaturalChangeRate1019 WhiteNonHispanicPct2010 HispanicPct2010
## 6.14 3.10 2.76
## Type_2015_Update
## 1.18
# Identify problematic variables (VIF > 5 indicates severe multicollinearity)
high_vif <- names(vif_values)[vif_values > 5]
if (length(high_vif) > 0) {
print(paste("Variables with high multicollinearity:", paste(high_vif, collapse = ", ")))
} else {
print("No serious multicollinearity detected.")
}## [1] "Variables with high multicollinearity: Ed5CollegePlusPct, NaturalChangeRate1019"
Based on the plots, final model does not met the linear model assumptions.
##
## Call:
## lm(formula = log_death_rate ~ StateCA + StateCO + StateIA + StateID +
## StateIL + StateKS + StateKY + StateLA + StateME + StateMI +
## StateMN + StateMO + StateMS + StateMT + StateNC + StateND +
## StateNE + StateNH + StateNJ + StateNM + StateNV + StateNY +
## StateOH + StateOK + StateOR + StateSD + StateUT + StateVA +
## StateVT + StateWA + StateWI + StateWV + PerCapitaInc + PctEmpConstruction +
## PctEmpMining + PctEmpAgriculture + PopDensity2010 + Age65AndOlderPct2010 +
## Under18Pct2010 + Ed3SomeCollegePct + Ed5CollegePlusPct +
## NetMigrationRate1019 + NaturalChangeRate1019 + WhiteNonHispanicPct2010 +
## HispanicPct2010 + Type_2015_Update, data = county_data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.734 -0.254 0.061 0.376 3.565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.68e+00 2.58e-01 18.13 < 2e-16 ***
## StateCA -8.83e-01 1.15e-01 -7.66 2.4e-14 ***
## StateCO -4.63e-01 1.09e-01 -4.23 2.4e-05 ***
## StateIA 2.41e-01 8.91e-02 2.70 0.00698 **
## StateID -4.13e-01 1.30e-01 -3.18 0.00151 **
## StateIL 1.53e-01 8.76e-02 1.74 0.08124 .
## StateKS -7.13e-01 9.14e-02 -7.80 8.3e-15 ***
## StateKY -7.85e-01 8.11e-02 -9.68 < 2e-16 ***
## StateLA 3.14e-01 1.07e-01 2.93 0.00338 **
## StateME -1.43e+00 2.01e-01 -7.10 1.5e-12 ***
## StateMI -1.95e-01 9.38e-02 -2.08 0.03772 *
## StateMN -2.11e-01 9.45e-02 -2.23 0.02596 *
## StateMO -4.28e-01 8.23e-02 -5.20 2.2e-07 ***
## StateMS 2.22e-01 9.66e-02 2.29 0.02197 *
## StateMT 4.57e-01 1.20e-01 3.82 0.00014 ***
## StateNC -4.26e-01 8.41e-02 -5.06 4.4e-07 ***
## StateND 8.76e-01 1.28e-01 6.84 9.6e-12 ***
## StateNE -4.64e-01 9.86e-02 -4.70 2.7e-06 ***
## StateNH -8.56e-01 2.53e-01 -3.38 0.00074 ***
## StateNJ 2.57e-01 1.80e-01 1.43 0.15371
## StateNM -8.04e-01 1.51e-01 -5.32 1.1e-07 ***
## StateNV -7.10e-01 1.98e-01 -3.58 0.00036 ***
## StateNY -4.24e-01 1.09e-01 -3.88 0.00011 ***
## StateOH -5.53e-01 9.19e-02 -6.02 2.0e-09 ***
## StateOK -4.58e-01 9.64e-02 -4.75 2.1e-06 ***
## StateOR -9.52e-01 1.40e-01 -6.80 1.2e-11 ***
## StateSD 7.48e-01 1.12e-01 6.66 3.2e-11 ***
## StateUT -1.19e+00 1.64e-01 -7.22 6.3e-13 ***
## StateVA -5.38e-01 7.54e-02 -7.13 1.2e-12 ***
## StateVT -2.19e+00 2.16e-01 -10.12 < 2e-16 ***
## StateWA -9.45e-01 1.33e-01 -7.10 1.6e-12 ***
## StateWI -1.70e-01 9.97e-02 -1.70 0.08883 .
## StateWV -7.38e-01 1.16e-01 -6.38 2.1e-10 ***
## PerCapitaInc -7.02e-06 4.63e-06 -1.52 0.12982
## PctEmpConstruction -4.86e-02 6.73e-03 -7.22 6.6e-13 ***
## PctEmpMining -1.72e-02 4.98e-03 -3.45 0.00057 ***
## PctEmpAgriculture -3.95e-02 3.29e-03 -12.00 < 2e-16 ***
## PopDensity2010 2.53e-05 9.19e-06 2.75 0.00592 **
## Age65AndOlderPct2010 3.40e-02 7.53e-03 4.51 6.7e-06 ***
## Under18Pct2010 6.07e-02 7.52e-03 8.06 1.1e-15 ***
## Ed3SomeCollegePct -1.89e-02 4.61e-03 -4.10 4.2e-05 ***
## Ed5CollegePlusPct -1.16e-02 3.38e-03 -3.44 0.00058 ***
## NetMigrationRate1019 -1.16e-02 2.37e-03 -4.90 9.9e-07 ***
## NaturalChangeRate1019 -4.56e-02 9.74e-03 -4.68 2.9e-06 ***
## WhiteNonHispanicPct2010 -3.17e-03 1.28e-03 -2.48 0.01309 *
## HispanicPct2010 1.02e-02 1.78e-03 5.71 1.2e-08 ***
## Type_2015_Update -1.96e-02 8.49e-03 -2.31 0.02101 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.789 on 3058 degrees of freedom
## Multiple R-squared: 0.357, Adjusted R-squared: 0.347
## F-statistic: 36.9 on 46 and 3058 DF, p-value: <2e-16
Elderly and Hispanic populations experienced significantly higher COVID death rates, with a 1% increase in the elderly population leading to a 3.4% rise in mortality and a 1% increase in the Hispanic population resulting in a 1.02% rise. The impact on Black populations remains inconclusive as the variable was not retained in the final model.
State effects: Many states have significant coefficients, indicating that the COVID-19 death rate varies by state even when controlling for other factors. Some states (e.g., Vermont, Kentucky, Montana) show highly negative coefficients, suggesting lower death rates, while others have positive effects.
Age factor: The coefficient for the percentage of the population aged 65 and older is positive and highly significant (p < 0.001), supporting the argument that COVID-19 affects the elderly more.
Economic and socioeconomic factors:
A few way to improve the model: - Transforming Variables: Applying a log or other transformation (e.g., square root) can address non-linearity and heteroscedasticity by stabilizing variance. - Adding Polynomial Terms. - Using Robust Regression: When variance is non-constant or outliers heavily influence results, methods like White’s robust standard errors or weighted least squares can yield more reliable inferences.
The objective of this study is to identify key socioeconomic and demographic factors that contribute to county-level variations in COVID-19 death rates. By leveraging statistical modeling techniques, we aim to provide data-driven insights for policymakers to mitigate the impact of COVID-19.
The dataset includes county-level COVID-19 mortality rates and various socioeconomic indicators collected from multiple sources: - COVID-19 death rate data from public health reports. - Demographic and socioeconomic data from the U.S. Census Bureau. - Economic indicators such as income, employment, and industry distribution from federal databases.
The datasets were merged based on the FIPS county code to align COVID-19 mortality rates with corresponding socioeconomic data. Missing values were addressed by either removing incomplete records or imputing values where necessary.
This analysis highlights the significant role of age, race, economic conditions, and state policies in determining COVID-19 death rates. Policymakers should focus on targeted interventions such as increasing healthcare access for vulnerable populations, improving support for elderly communities, and addressing disparities in Hispanic communities.
A detailed summary of the variables in each data set follows:
Income: Poverty level and household income
Jobs: Employment type, rate, and change
UnempRate2007-2019: Unemployment rate, 2007-2019
NumEmployed2007-2019: Employed, 2007-2019
NumUnemployed2007-2019: Unemployed, 2007-2019
PctEmpChange1019: Percent employment change, 2010-19
PctEmpChange1819: Percent employment change, 2018-19
PctEmpChange0719: Percent employment change, 2007-19
PctEmpChange0710: Percent employment change, 2007-10
NumCivEmployed: Civilian employed population 16 years and over, 2014-18
NumCivLaborforce2007-2019: Civilian labor force, 2007-2019
PctEmpFIRE: Percent of the civilian labor force 16 and over employed in finance and insurance, and real estate and rental and leasing, 2014-18
PctEmpConstruction: Percent of the civilian labor force 16 and over employed in construction, 2014-18
PctEmpTrans: Percent of the civilian labor force 16 and over employed in transportation, warehousing and utilities, 2014-18
PctEmpMining: Percent of the civilian labor force 16 and over employed in mining, quarrying, oil and gas extraction, 2014-18
PctEmpTrade: Percent of the civilian labor force 16 and over employed in wholesale and retail trade, 2014-18
PctEmpInformation: Percent of the civilian labor force 16 and over employed in information services, 2014-18
PctEmpAgriculture: Percent of the civilian labor force 16 and over employed in agriculture, forestry, fishing, and hunting, 2014-18
PctEmpManufacturing: Percent of the civilian labor force 16 and over employed in manufacturing, 2014-18
PctEmpServices: Percent of the civilian labor force 16 and over employed in services, 2014-18
PctEmpGovt: Percent of the civilian labor force 16 and over employed in public administration, 2014-18
People: Population size, density, education level, race, age, household size, and migration rates
PopDensity2010: Population density, 2010
LandAreaSQMiles2010: Land area in square miles, 2010
TotalHH: Total number of households, 2014-18
TotalOccHU: Total number of occupied housing units, 2014-18
AvgHHSize: Average household size, 2014-18
OwnHomeNum: Number of owner occupied housing units, 2014-18
OwnHomePct: Percent of owner occupied housing units, 2014-18
NonEnglishHHPct: Percent of non-English speaking households of total households, 2014-18
HH65PlusAlonePct: Percent of persons 65 or older living alone, 2014-18
FemaleHHPct: Percent of female headed family households of total households, 2014-18
FemaleHHNum: Number of female headed family households, 2014-18
NonEnglishHHNum: Number of non-English speaking households, 2014-18
HH65PlusAloneNum: Number of persons 65 years or older living alone, 2014-18
Age65AndOlderPct2010: Percent of population 65 or older, 2010
Age65AndOlderNum2010: Population 65 years or older, 2010
TotalPop25Plus: Total population 25 and older, 2014-18 - 5-year average
Under18Pct2010: Percent of population under age 18, 2010
Under18Num2010: Population under age 18, 2010
Ed1LessThanHSPct: Percent of persons with no high school diploma or GED, adults 25 and over, 2014-18
Ed2HSDiplomaOnlyPct: Percent of persons with a high school diploma or GED only, adults 25 and over, 2014-18
Ed3SomeCollegePct: Percent of persons with some college experience, adults 25 and over, 2014-18
Ed4AssocDegreePct: Percent of persons with an associate’s degree, adults 25 and over, 2014-18
Ed5CollegePlusPct: Percent of persons with a 4-year college degree or more, adults 25 and over, 2014-18
Ed1LessThanHSNum: No high school, adults 25 and over, 2014-18
Ed2HSDiplomaOnlyNum: High school only, adults 25 and over, 2014-18
Ed3SomeCollegeNum: Some college experience, adults 25 and over, 2014-18
Ed4AssocDegreeNum: Number of persons with an associate’s degree, adults 25 and over, 2014-18
Ed5CollegePlusNum: College degree 4-years or more, adults 25 and over, 2014-18
ForeignBornPct: Percent of total population foreign born, 2014-18
ForeignBornEuropePct: Percent of persons born in Europe, 2014-18
ForeignBornMexPct: Percent of persons born in Mexico, 2014-18
ForeignBornCentralSouthAmPct: Percent of persons born in Central or South America, 2014-18
ForeignBornAsiaPct: Percent of persons born in Asia, 2014-18
ForeignBornCaribPct: Percent of persons born in the Caribbean, 2014-18
ForeignBornAfricaPct: Percent of persons born in Africa, 2014-18
ForeignBornNum: Number of people foreign born, 2014-18
ForeignBornCentralSouthAmNum: Number of persons born in Central or South America, 2014-18
ForeignBornEuropeNum: Number of persons born in Europe, 2014-18
ForeignBornMexNum: Number of persons born in Mexico, 2014-18
ForeignBornAfricaNum: Number of persons born in Africa, 2014-18
ForeignBornAsiaNum: Number of persons born in Asia, 2014-18
ForeignBornCaribNum: Number of persons born in the Caribbean, 2014-18
Net_International_Migration_Rate_2010_2019: Net international migration rate, 2010-19
Net_International_Migration_2010_2019: Net international migration, 2010-19
Net_International_Migration_2000_2010: Net international migration, 2000-10
Immigration_Rate_2000_2010: Net international migration rate, 2000-10
NetMigrationRate0010: Net migration rate, 2000-10
NetMigrationRate1019: Net migration rate, 2010-19
NetMigrationNum0010: Net migration, 2000-10
NetMigration1019: Net Migration, 2010-19
NaturalChangeRate1019: Natural population change rate, 2010-19
NaturalChangeRate0010: Natural population change rate, 2000-10
NaturalChangeNum0010: Natural change, 2000-10
NaturalChange1019: Natural population change, 2010-19
TotalPop2010: Population size 4/1/2010 Census
TotalPopEst2010: Population size 7/1/2010
TotalPopEst2011: Population size 7/1/2011
TotalPopEst2012: Population size 7/1/2012
TotalPopEst2013: Population size 7/1/2013
TotalPopEst2014: Population size 7/1/2014
TotalPopEst2015: Population size 7/1/2015
TotalPopEst2016: Population size 7/1/2016
TotalPopEst2017: Population size 7/1/2017
TotalPopEst2018: Population size 7/1/2018
TotalPopEst2019: Population size 7/1/2019
TotalPopACS: Total population, 2014-18 - 5-year average
TotalPopEstBase2010: County Population estimate base 4/1/2010
NonHispanicAsianPopChangeRate0010: Population change rate Non-Hispanic Asian, 2000-10
PopChangeRate1819: Population change rate, 2018-19
PopChangeRate1019: Population change rate, 2010-19
PopChangeRate0010: Population change rate, 2000-10
NonHispanicNativeAmericanPopChangeRate0010: Population change rate Non-Hispanic Native American, 2000-10
HispanicPopChangeRate0010: Population change rate Hispanic, 2000-10
MultipleRacePopChangeRate0010: Population change rate multiple race, 2000-10
NonHispanicWhitePopChangeRate0010: Population change rate Non-Hispanic White, 2000-10
NonHispanicBlackPopChangeRate0010: Population change rate Non-Hispanic African American, 2000-10
MultipleRacePct2010: Percent multiple race, 2010
WhiteNonHispanicPct2010: Percent Non-Hispanic White, 2010
NativeAmericanNonHispanicPct2010: Percent Non-Hispanic Native American, 2010
BlackNonHispanicPct2010: Percent Non-Hispanic African American, 2010
AsianNonHispanicPct2010: Percent Non-Hispanic Asian, 2010
HispanicPct2010: Percent Hispanic, 2010
MultipleRaceNum2010: Population size multiple race, 2010
WhiteNonHispanicNum2010: Population size Non-Hispanic White, 2010
BlackNonHispanicNum2010: Population size Non-Hispanic African American, 2010
NativeAmericanNonHispanicNum2010: Population size Non-Hispanic Native American, 2010
AsianNonHispanicNum2010: Population size Non-Hispanic Asian, 2010
HispanicNum2010: Population size Hispanic
Type of county (rural or urban on a rural-urban continuum scale)
Type_2015_Recreation_NO: Recreation counties, 2015 edition
Type_2015_Farming_NO: Farming-dependent counties, 2015 edition
Type_2015_Mining_NO: Mining-dependent counties, 2015 edition
Type_2015_Government_NO: Federal/State government-dependent counties, 2015 edition
Type_2015_Update: County typology economic types, 2015 edition
Type_2015_Manufacturing_NO: Manufacturing-dependent counties, 2015 edition
Type_2015_Nonspecialized_NO: Nonspecialized counties, 2015 edition
RecreationDependent2000: Nonmetro recreation-dependent, 1997-00
ManufacturingDependent2000: Manufacturing-dependent, 1998-00
FarmDependent2003: Farm-dependent, 1998-00
EconomicDependence2000: Economic dependence, 1998-00
RuralUrbanContinuumCode2003: Rural-urban continuum code, 2003
UrbanInfluenceCode2003: Urban influence code, 2003
RuralUrbanContinuumCode2013: Rural-urban continuum code, 2013
UrbanInfluenceCode2013: Urban influence code, 2013
Noncore2013: Nonmetro noncore, outside Micropolitan and Metropolitan, 2013
Micropolitan2013: Micropolitan, 2013
Nonmetro2013: Nonmetro, 2013
Metro2013: Metro, 2013
Metro_Adjacent2013: Nonmetro, adjacent to metro area, 2013
Noncore2003: Nonmetro noncore, outside Micropolitan and Metropolitan, 2003
Micropolitan2003: Micropolitan, 2003
Metro2003: Metro, 2003
Nonmetro2003: Nonmetro, 2003
NonmetroNotAdj2003: Nonmetro, nonadjacent to metro area, 2003
NonmetroAdj2003: Nonmetro, adjacent to metro area, 2003
Oil_Gas_Change: Change in the value of onshore oil and natural gas production, 2000-11
Gas_Change: Change in the value of onshore natural gas production, 2000-11
Oil_Change: Change in the value of onshore oil production, 2000-11
Hipov: High poverty counties, 2014-18
Perpov_1980_0711: Persistent poverty counties, 2015 edition
PersistentChildPoverty_1980_2011: Persistent child poverty counties, 2015 edition
PersistentChildPoverty2004: Persistent child poverty counties, 2004
PersistentPoverty2000: Persistent poverty counties, 2004
Low_Education_2015_update: Low education counties, 2015 edition
LowEducation2000: Low education, 2000
HiCreativeClass2000: Creative class, 2000
HiAmenity: High natural amenities
RetirementDestination2000: Retirement destination, 1990-00
Low_Employment_2015_update: Low employment counties, 2015 edition
Population_loss_2015_update: Population loss counties, 2015 edition
Retirement_Destination_2015_Update: Retirement destination counties, 2015 edition
The raw data sets are dirty and need transforming before we can do
our EDA. It takes time and efforts to clean and merge different data
sources so we provide the final output of the cleaned and merged data.
The cleaning procedure is as follows. Please read through to understand
what is in the cleaned data. We set eval = data_cleaned in
the following cleaning chunks so that these cleaning chunks will only
run if any of data/covid_county.csv,
data/covid_rates.csv or
data/covid_intervention.csv does not exist.
# Indicator to check whether the data files exist
data_cleaned <- !(file.exists("data/covid_county.csv") &
file.exists("data/covid_rates.csv") &
file.exists("data/covid_intervention.csv"))We first read in the table using data.table::fread(), as
we did last time.
# COVID case/mortality rate data
covid_rates <- fread("data/us_counties.csv", na.strings = c("NA", "", "."))
nyc <- fread("data/nycdata.csv", na.strings = c("NA", "", "."))
# Socioeconomic data
income <- fread("data/income.csv", na.strings = c("NA", "", "."))
jobs <- fread("data/jobs.csv", na.strings = c("NA", "", "."))
people <- fread("data/people.csv", na.strings = c("NA", "", "."))
county_class <- fread("data/county_classifications.csv", na.strings = c("NA", "", "."))
# Internvention policy data
int_dates <- fread("data/intervention_dates.csv", na.strings = c("NA", "", "."))The original NYC data contains more information than we need. We
extract only the number of cases and deaths and format it the same as
the covid_rates data.
# NYC county fips matching table
nyc_fips <- data.table(FIPS = c('36005', '36047', '36061', '36081', '36085'),
County = c("BX", "BK", "MN", "QN", "SI"))
# nyc case
nyc_case <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
BX = BX_CASE_COUNT,
BK = BK_CASE_COUNT,
MN = MN_CASE_COUNT,
QN = QN_CASE_COUNT,
SI = SI_CASE_COUNT)]
nyc_case %<>%
pivot_longer(cols = BX:SI,
names_to = "County",
values_to = "cases") %>%
arrange(date) %>%
group_by(County) %>%
mutate(cum_cases = cumsum(cases))
# nyc death
nyc_death <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
BX = BX_DEATH_COUNT,
BK = BK_DEATH_COUNT,
MN = MN_DEATH_COUNT,
QN = QN_DEATH_COUNT,
SI = SI_DEATH_COUNT)]
nyc_death %<>%
pivot_longer(cols = BX:SI,
names_to = "County",
values_to = "deaths") %>%
arrange(date) %>%
group_by(County) %>%
mutate(cum_deaths = cumsum(deaths))
nyc_rates <- merge(nyc_case,
nyc_death,
by = c("date", "County"),
all.x= T)
nyc_rates <- merge(nyc_rates,
nyc_fips,
by = "County")
nyc_rates$State <- "New York"
nyc_rates %<>%
select(date, FIPS, County, State, cum_cases, cum_deaths) %>%
arrange(FIPS, date)We only consider cases and death in continental US. Alaska, Hawaii,
and Puerto Rico have 02, 15, and 72 as their respective first 2 digits
of their FIPS. We use the %/% operator for integer division
to get the first 2 digits of FIPS. We also remove Virgin Islands and
Northern Mariana Islands. All data of counties in NYC are aggregated as
County == "New York City" in covid_rates with
no FIPS, so we combine the NYC data into covid_rate.
covid_rates <- covid_rates %>%
arrange(fips, date) %>%
filter(!(fips %/% 1000 %in% c(2, 15, 72))) %>%
filter(county != "New York City") %>%
filter(!(state %in% c("Virgin Islands", "Northern Mariana Islands"))) %>%
rename(FIPS = "fips",
County = "county",
State = "state",
cum_cases = "cases",
cum_deaths = "deaths")
covid_rates$date <- as.Date(covid_rates$date)
covid_rates <- rbind(covid_rates,
nyc_rates)We set the week of Jan 21, 2020 (the first case of COVID case in US) as the first week (2020-01-19 to 2020-01-25).
Merge the TotalPopEst2019 variable from the demographic
data with covid_rates by FIPS.
NA values in the covid_rates data set correspond to a
county not having confirmed cases/deaths. We replace the NA values in
these columns with zeros. FIPS for Kansas city, Missouri, Rhode Island
and some others are missing. We drop them for the moment and output the
data up to week 57 as covid_rates.csv.
int_datesWe convert the columns representing dates in int_dates
to R Date types using as.Date(). We will need to specify
that the origin parameter is "0001-01-01". We
output the data as covid_intervention.csv.
Merge the demographic data sets by FIPS and output as
covid_county.csv.
countydata <-
merge(x = income,
y = merge(
x = people,
y = jobs,
by = c("FIPS", "State", "County")),
by = c("FIPS", "State", "County"),
all = TRUE)
countydata <-
merge(
x = countydata,
y = county_class %>% rename(FIPS = FIPStxt),
by = c("FIPS", "State", "County"),
all = TRUE
)
# Check dimensions
# They are now 3279 x 208
dim(countydata)
fwrite(countydata, "data/covid_county.csv")